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ABSTRACT 

Several energy functions for synthesizing neural 
networks are tested on 2 -D synthetic data and on 
Landsat-4 Thematic Mapper data. These new en- 
ergy functions, designed specifically for minimizing 
misclassification error, in some cases yield significant 
improvements in classification accuracy over the stan- 
dard least mean squares energy function. In addition 
to operating on networks with one output unit per 
class, a new energy function is tested for binary en- 
coded outputs, which result in smaller network sizes. 
The Thematic Mapper data (four bands were used) is 
classified on a single pixel basis, to provide a starting 
benchmark against which further improvements will 
be measured. Improvements are underway to make 
use of both subpixel and superpixel (i.e. contextual 
or neighborhood) information in the processing. For 
single pixel classification, the best neural network re- 
sult is 78.7%, compared with 71.7% for a classical 
nearest neighbor classifier. The 78.7% result also im- 
proves on several earlier neural network results on 
this data. 

INTRODUCTION 

In the past several years, a general awareness of 
the environmental crises has gradually taken place 
among the world’s nations. We wish to address auto- 
mated surveillance technology for environmental is- 
sues. Global warming, ozone depletion, large-scale 
deforestation, extinction of species are just a few of 
the issues that could lead to serious consequences to 
all inhabitants on the Earth, in a scale that will re- 
spect no national or political boundaries. To under- 
stand and quantify the anthropogenic impact on the 
environment, and to predict the eventualities if the 
deteriorating trend is not reverted, consistent and 
long-term monitoring of the global environment is 
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needed. Through the Earth Probes and the Earth 
Observation System (EOS), NASA’s Mission to the 
Planet Earth will continue to provide the essential 
measurements. 

The amount of measurements from the Mission 
to the Plant Earth, however, will be unprecedented. 
For example, the first EOS AM platform alone will 
generate more than one terabyte (TB) data a day, 
compared with the 5 TB from the entire 12 years of 
AVIIRR Pathfinder data. To timely process, analyze, 
store, and disseminate the satellite measurements and 
extracted information to a worldwide user community 
presents a formidable challenge, and demands inno- 
vative analytical methods and advanced computing 
and data communication technologies. 

Among the contemporary information sciences, 
neural networks have proven to be a versatile tech- 
nique for input-to-output mapping, without the con- 
straint of formulating the exact relationship between 
the two. In addition, contextual and neighborhood 
knowledge can be easily included. In the past few 
years, neural networks have been applied to classifi- 
cations of remotely sensed data (e.g., Campbell et al. 
1989, Decatur 1989, Benediktsson et al. 1990, Liu et 
al. 1991, Bischof et al. 1992, Kiang 1992). In these 
studies, spectral data and ground truth are input to 
multilayer perceptron networks with one or more hid- 
den layers, and networks are extensively trained off- 
line by minimizing a least-mean-squares (LMS) en- 
ergy function with back-propagation (Werbos 1974, 
Ru me! hart et al. 1986). It has been shown that the 
performance of neural network techniques is superior 
to classical techniques for systems operating in real- 
time. 

It is well documented that minimizing the LMS en- 
ergy function produces a neural network that approx- 
imates the Bayesian a posteriori probabilities (the 
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probability of a class given a particular input vector) 
of classes of data represented by a training set (see 
Richard k Lippmann 1991, for a review). Given an 
infinitely large training set and a network with suf- 
ficient functional complexity, the approximation er- 
ror becomes negligible, and the misclassification er- 
ror converges to the Bayes rate. While this property 
makes the LMS energy function attractive, there is 
an important qualification. The functional complex- 
ity needed for approximating the a posteriori proba- 
bilities is greater than that needed for approximating 
only class boundaries. Thus, if we are only inter- 
ested in the classification of an input, rather than its 
a posteriori probability, a neural network that esti- 
mates probabilities will be needlessly complex. The 
additional complexity is a disadvantage both from 
the principle of parsimony (using the smallest num- 
ber of weight parameters to increase generalization 
[see, e.g., Barron k Cover 1991]) and from the hard- 
ware implementation standpoint. Therefore, we test 
energy functions that minimize the misclassification 
error directly (Szu k Telfer 1991, Tclfer k Szu 1992a), 
rather than indirectly via approximating the a poste- 
riori probabilities. We call these Minimum Misclas- 
sification Error (MME) energy functions. 

We first formulate these energy functions and pro- 
vide a two-feature example that illustrates the con- 
cept. The Landsat Thematic Mapper data is de- 
scribed and results are presented for classifying on 
a pixel-by-pixel basis. These results are intended to 
provide a benchmark for further improvements that 
make use of botli subpixel and superpixel (contex- 
tual) information. The paper concludes by discussing 
these research directions. 

ENERGY FUNCTION FOR- 
MULATION 

The commonly used <r-LMS energy function is 
given by 

N K 

Eo-LMS — - <r{o n k)] 2 , (1) 

n = 1 k — 1 

where d n k is the desired output (normally set to 0 
or 1) of the k - th output unit for the n-th training 
vector, a is a sigmoidal function [we use a(z) — 
1/(1 -f cxp(-z)], and is the output of the fc-th 
output unit for the n - th training vector, before the 
sigmoidal nonlinearity is applied. With one output 
unit per class, and d c * = 1 for training vectors from 
class c, d c k — 0 otherwise, minimizing E g -lms pro- 
duces outputs that approximate t he Bayesian a pos- 


teriori probabilities. An input vector is then classi- 
fied according to the largest output value. However, 
for practical applications (finite training sets and net- 
works with limited functional complexity), E 0 -lms 
function is not guaranteed to minimize misclassifica- 
tion error (Barnard k Casasent 1989). 

A more natural energy function for classification 
simply counts the number of training vectors that the 
network misclassifies. The formulation of this count- 
ing function varies depending on the output encoding. 
For a two-class problem, a single output unit suffices, 
with positive outputs indicating one class and nega- 
tive outputs indicating the other. A counting func- 
tion for this network is given by (Szu k Telfer 1991, 
Telfer k Szu 1992a) 

N 

Emme = N - ^ step(d n o n ) ) (2) 

n = 1 

where d n is the desired sign of the actual output o n 
and step(z) = 1 if z > 0; step(z) = 0 otherwise. (Eq. 
2 thus uses a sharp membership function; a fuzzy 
logic version would be an obvious extension.) When 
the desired sign is the same as that of the actual 
output o n , the tli training vector x n is correctly 
classified, the step function equals 1, and the number 
of misclassifications Emme is reduced by one. When 
the desired output sign and actual output sign differ, 
x n is misclassified, the step function equals 0, and 
Emme is not reduced. To minimize an energy func- 
tion with gradient descent, the energy function must 
be differentiable. Although the step function in Eq. 
2 is not differentiable, it can be approximated by a 
sigmoidal function that is gradually steepening. As 
the magnitudes of the network weights increase, the 
magnitudes of the network outputs o n also increase, 
and the sigmoid behaves more and more like a step 
function required by Eq, 2. 

For multiple classes, if there is one output unit per 
class and an input is classified based on the largest 
output, an appropriate counting function, called the 
Classification Figure of Merit (CFM) (Hampshire k 
Waibel 1990), is given by 

N 

EcFM — N ^ ^(^niax Ovther)i (3) 

>i = l 

where o max is the output from the unit that should 
have the maximum value (corresponding to the train- 
ing vector’s class) and o ot i ier is the largest value of the 
other output units. Here the step function has been 
replaced by a sigmoid with the above justification. 
For a correct classification, o max - o ot her > 0 and 
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(r(o mar - o other) 1, and the number of misclas- 
sifications is reduced by 1. For a misclassifi cation, 
Omax Mother ^ 0 and ^(Omar Uofher) ¥ 0, and the 
number of misclassifi cat ions is not reduced. A proof 
showing tliat minimizing Eqfm does give the desired 
result is given in (Hampshire k Pearlmutter 1991). 

With multiple classes, the outputs may also be bi- 
nary encoded labels, in which case the outputs are 
passed through a threshold rather than a maximum 
detector. An advantage of binary encoded outputs 
over one output unit per class is that fewer output 
units are required. For example, for 16 classes, one 
output unit per class requires 16 output units, but bi- 
nary encoded outputs require only 4 output units. Ill 
addition, error correcting codes can be used as class 
labels. For example, a Hamming code (Lin k Costello 
1983) with 7 output units can encode 16 classes and 
correct a single error in the output units. Such an 
error correcting approach increases classification ac- 
curacy and has been shown to improve associative 
memory performance (Liebowitz k Casasent 1986, 
Casasent k Telfer 1992). A new MME energy func- 
tion to minimize misclassification error for binary en- 
coded outputs is given by 

N K 

Em ME = N — ^2 *[Y. ff {dnkOnli) ~ K + 0 - 5 ]. ( 4 ) 

71 = 1 k = 1 

The summation over k equals the number of correct 
output units for the n-th training vector. If all are 
correct, the summation equals K, and the outer sig- 
moid becomes 1, which reduces the number of in- 
correct misclassifications by 1. If there are one or 
output errors, the summation over k equals at most 
I\ — 1 (for a single output error) and the outer sig- 
moid becomes 0, and the misclassification count is not 
reduced. Note that in this case of multiple classes, 
Emafe must determine from all the output units 
whether a classification is correct or not. It is not 
sufficient to simply sum the errors from each output 
unit individually by summing Eq. 2 over multiple 
classes. 

2 -D EXAMPLE 

Before considering the Thematic Mapper data, we 
consider a simpler two-class example of synthetic data 
with two features. This allows the class boundaries 
to be easily visualized to provide insight into LMS 
and MME energy functions. Since the data set is 
much smaller than the Thematic Mapper data, it also 
allows more detailed study. 


Two classes with equal a prion probabilities are 
drawn from concentric circular uniform distributions 
with radius V^/2 (class 1) and 1 (class 2). The Bayes 
rate (minimum error) is 0.25, with a circular bound- 
ary of radius \/2/2. The training set consists of 1000 
vectors from each class and is shown in Figure 1. (The 
class boundaries shown in Figure 1 will be discussed 
shortly.) The test set consists of 5000 vectors from 
each class. The larger test set is needed to increase 
the confidence levels of the results. 

The following study considered L\ and L 2 norm 
versions of the two-class Em me- More details are 
provided elsewhere (Telfer k Szu 1992b). The 
method described in the formulation section is the 
L\ version. Multilayer perceptrons with two layers 
of weights and varying numbers of hidden units were 
tested for < 7 - LMS, MME LI and MME L2. The proce- 
dure was to randomly initialize the weights to values 
between ±1, first train each network for 200 itera- 
tions (epochs) using a- LMS, and then using that re- 
sult as a starting point, train for 800 iterations using 
the three energy functions. The motivation for the 
initial 200 iterations was to move the networks into 
a reasonable area of weight space which could then 
be tuned further by each energy function. This was 
found to produce better results than simply start- 
ing with each energy function from random weights. 
Other random weight magnitudes were also tried to 
ensure that the best results possible from each energy 
function were being measured. A conjugate gradient 
method (Fletcher 1987) was used (restart cycle of 5) 
with a simple inexact line search in implementing the 
backpropagation algorithm. For each number of hid- 
den units, ten initial sets of random weights were con- 
structed. In an attempt to discount runs that became 
stuck in local minima, only the run that gave the min- 
imum training set error for each energy function was 
included in the results. 

Figure 2a plots the performance of each energy 
function vs. number of hidden units. The MME 
energy functions produce excellent results with only 
three hidden units, and as more hidden units are 
added, they descend to essentially identical train- 
ing set errors of 0.246 for MME L2 and 0.248 for 
MME LI with 8 hidden units. Since it was plain 
that the MME energy functions were reliably finding 
minimum error networks, their hidden units were not 
increased beyond 8. For a- LMS, the training set error 
also slowly decreased with increasing numbers of hid- 
den units, but consistently remained higher than the 
MME training set results and the Bayes rate. With 
16 hidden units, a-LMS still gave 0.259 error, over 1% 


171 



Misolassification error 



Figure 1: Training set for 2 -D case with class boundaries found by <r-LMS and MME L2 networks with four 
hidden units. 




(a) (b) 


Figure 2: (a) Training set and (b) test set results for different energy functions vs. number of hidden units. 


higher than for the MME energy functions. For an 
infinitely large training set, cr-LMS would converge 
to the Bayes rate, but this result does not hold for a 
finite training set. 

These results are also reflected in the MME L2 
and (j-LMS class boundaries with four hidden units, 
plotted in Figure 1. The MME L2 boundary is clearly 
almost exactly the desired circle, while the c-LMS 
boundary is consistently inside the optimal boundary. 

Of course, the more important Question is how the 
networks performed on the test set. These results aie 
plotted in Figure 2b with 95% confidence intervals 
(Uighleyman 1962). The test set errors are all higher 
than the respective training set results as expected. 
The MME test set results are still lower than the 
(t-LMS results, and the results are statistically sig- 
nificant (although there is a slight overlap between 
the MME L2 and <r-LMS results at 8 hidden units, 
even this is still significant with a high but less than 
95% confidence level). Even with 16 hidden units, 
the <r-LMS result is still significantly (in the statisti- 
cal sense) worse than all but 4 of all 8 MME results 
with 8 or fewer hidden units. Thus, for this exam- 
ple, a-LMS requires roughly five times the number of 
hidden units of the MME energy functions (16 vs. 3) 
to give equal test set performance. 

LANDSAT EXAMPLE 

Description of Data 

Landsat-4 Thematic Mapper (TM) data taken in 
July 1982 over an area in the vicinity of Washing- 
ton, D.C. were used in this study. The TM is a 7- 
band instrument, with spectral coverages 0.45-0.52 
(TM1), 0.52-0.60 (TM2), 0.63-0.69 (TM3), 0.76-0.90 
(TM4), 1.55-1.75 (TM5), 10.40-12.50 (TM6), and 
2.08-2.35 (TM7). The ground Instantaneous Field- 
of-View (1FOY) is 30m except for the thermal bauds 
(TM6), which is 120m. As the infrared and the ther- 
mal bands had not yet cooled off after launch, only 
the first four bands are usable. 

The ground truth consists of 17 categories, and 
were obtained through photointerpretation of color 
infrared aerial photographs and subsequent field vis 
its (Williams et al. 1984). Specifically, the categories 
are (1) water, (2) miscellaneous crops, (3) stand- 
ing corn, (4) corn stubble, (5) shrubland, (6) gi ass- 
land or pasture, (7) soybeans, (8) bare soil/cleaied 
land, (9) mostly hardwood dense canopy, (10) mostly 
hardwood less dense, (11) mostly conifer, (12) mixed 


wood, (13) asphalt, (14) single-family residential 
area, (15) multi-family residential area, (16) indus- 
trial or commercial area, and (17) bare soil/plowed 
fields. 

In general, ground truth contains information cat- 
egories instead of spectral categories. As the IFOV 
is broad enough to cover multiple ground categories, 
there are natural overlaps among the spectral signa- 
tures for these categories. Since the neural networks 
in this study perform classifications based on spec- 
tral data alone, whether the information categories 
correspond to distinct spectral categories should be 
examined, in order to estimate the intrinsic discrim- 
inability among the categories. 

To achieve this objective, the spectral signatures 
for all categories are computed. The signatures con- 
sist of mean vectors and covariance matrices. A num- 
ber of measures, such as divergence and Mahalanobis 
distance, could be used to estimate the separabil- 
ity among multi-dimensional clusters. In this study, 
we compute the ratio of between-class variance to 
within-class variance along the Fisher optimal dis- 
criminant vector (Duda & Hart, 1973). From the ra- 
tios, it is concluded that some information categories 
are heavily overlapped with others, and that the 17 
information categories should be combined into 6 cat- 
egories, following the land use and land cover classi- 
fication system of Anderson et al. (1976). These six 
categories are: (1) urban or built-up land, (2) agricul- 
tural land, (3) rangeland, (4) forest land, (5) water, 
and (7) bare soil/cleared land. Notice that there is 
no Category 6 (wetland) in this data. In Anderson’s 
system, Category 7 is barren land, such as salt flats, 
beaches, bare rock, etc. Since bare soil/cleared land 
(Category 17 in the ground truth data) does not ex- 
actly fit the definition, the original description in the 
ground truth is used instead. 

To give an idea of the terrain types present. Figure 
3 shows the four bands of the 256x256 image (slightly 
cropped for display purposes). Roads are clearly vis- 
ible. A housing development is at the upper right. 
Fields are visible in the center of the image. The 
dark areas are primarily forest. 

The area for which ground truth exists (a roughly 
150x150 area in the center of Figure 3) has 21,952 
pixels, with pixels placed alternately into training and 
test sets, giving 10,976 pixels for each. The number 
of pixels in each class is give in Table 1. Since each 
pixel contains four spectral bands, each feature vec- 
tor contains four features, with an additional element 
set to one to provide a bias term. Each of the four 
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Class Name 

No. Pixels 

Urban 

2754 

Agric 

1670 

Range 

3184 

Forest 

13781 

Water 

28 

Bare 

535 


Table 1: Class distribution of Landsat data. 


spectral features was normalized to have zero mean 
and standard deviation of 0.75. 

Procedure and Results 

Multilayer perceptions with (wo layers of weights 
and twelve hidden units were tested for E c -ims , 
Ecfm and Em me* (Networks with fewer hidden 
units were also tried but round to perform slightly 
worse.) For the network structure of one output unit 
per class, six output units were used, while for binary 
encoding, three output units were used. (Two of the 
possible eight codes were unused.) The procedure 
was to randomly initialize the weights to values be- 
tween ±1, first train each network for 500 iterations 
(epochs) using <r-LMS, and then using that result as 
a starting point, train for 1000 iterations using the 
three energy functions. A conjugate gradient method 
was used (restart cycle of 5) with a simple inexact line 
search. 

The resulting classification accuracies arc given in 
Table 2. We first consider the results for one out- 
put per class. Although CFM improved the <r-LMS 
training set accuracy by 1%, the test set results are 
identical. The small training set improvement in- 
dicates that cr-LMS is finding class boundaries very 
close to the minimum error boundaries. The excellent 
cr-LMS performance can be explained by the large 
training set size and apparently relatively small func- 
tional complexity needed to represent the a posteriori 
probabilities in this case. 

For the binary coded outputs, the cr-LMS outputs 
esimate the probabilities that the outputs are 1 given 
the input. This can be seen to perform worse than 
MME, which improves accuracy by 2.2% for the train- 
ing set and 1.2% for the test set.. The difference in the 
test set result is significant, with an SS% confidence 
level. (The 95% confidence level is ±0.75%.) There is 
no statistically significant difference between the two 
test set results for one output pet* class and MME 


Energy 

Accuracy {%) 

Output 

Function 

Train 

lest 

Encoding 

cr-LMS 

78.1 

78.7 

1 /class 

CFM 

79.1 

78.7 

1 / class 

cr-LMS 

76.4 

76.9 

binary code 

MME 

78.6 

78.1 

binary code 


Table 2: Classification accuracies for Landsat data. 

binary encoding, but these three results do differ sig- 
nificantly from the u-LMS binary encoding result. 
Although the saving in weights by binary encoding 
is not large in this example, for larger numbers of 
classes, the savings becomes significant. In addition, 
the binary encoding performance would be improved 
by using error correcting codes. 

For comparison, a classical nearest neighbor clas- 
sifier (Diula & Ilart, 1973) gave 71.7% test set accu- 
racy. Also, seven previous neural network tests (with 
various network architectures and sizes) on this data 
set have given test set accuracies between 71.6% and 
78.4%) (Kiang 1992, Hwang et al. 1993). Our best 
result of 78.7% is statistically better than all but one 
of these previous results (78.4%), and was obtained 
with a much smaller network - 132 weights for our 
network vs. about 640 weights required for the ra- 
dial basis function network giving 78.4%. The fact 
that this previous result is similar to our best results 
suggests that this could be the best possible accu- 
racy that can be obt ained by classifying single pixels. 
Further accuracy improvements can be obtained by 
making use of subpixel information and by classifying 
based on a neighborhood of pixels. We discuss this 
in the next section. 

There have been neural network based studies (e.g. 
Bischof et al. 1992) in which classification accuracies 
are higher than ours. However, it must he pointed 
out that a direct, fair comparison among these stud- 
ies may not be possible. As known in remote sens- 
ing applications, classification accuracies are highly 
dependent on the ground types involved, the sensors’ 
resolutions, the seasons when the measurements were 
taken and the environmental conditions. In general, 
discrimination among various kinds of vegetat ion cov- 
ers is rather difficult. 
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DISCUSSION OF FUTURE 
WORK 

The 78.7% classification accuracy for single pixel 
classification should be regarded as a starting point to 
benchmark further improvements that involve both 
subpixel and superpixel information. In addition to 
this work, a method for improving the training set is 
also discussed. 

Since the Thematic Mapper pixel footprint is 30m, 
the spectra from different, landusc types can be mixed 
in a single pixel. In related work (Shirmibukuro Sc 
Smith 1991), mixture components arc estimated us- 
ing conventional least squares techniques in order to 
estimate ages of eucalyptus areas. Neural network 
apporaches remain to be tested. Since a neural net- 
work trained by LMS estimates a posteriori proba- 
bilities, these can be used as mixing proportions to 
provide subpixel classification results. For example, 
it was observed in the LMS classification results de- 
scribed above that many of the pixels along a road 
passing through forest had large outputs correspond- 
ing to both urban (manmade) and forest. Rather 
than classifying the pixel as urban (road) or forest 
based on only the single largest output, it seems more 
appropriate to classify the pixel as a certain frac- 
tion urban/road and a certain fraction forest based 
on the two largest mixing components. Simply clas- 
sifying based on the largest output was observed to 
create many discrepancies with ground truth. For 
example, the groundtruth marks only discontinuous 
stretches of the road as urban and the rest as for- 
est. The LMS neural network classifies (based on 
largest output) the entire stretch of road as urban, 
but also has a high second largest output for forest. 
Thus, making use of subpixel mixtures should im- 
prove results. Mixture information provides general 
information about a pixel, but does not indicate the 
physical region within the pixel occupied by a par- 
ticular ground type. Super- resolution theory appears 
promising for physically locating ground types within 
pixels based on the classifications of nearby pixels. 

Conversely, since land use occurs in patches larger 
than the 30m pixel size, it seems clear that infor- 
mation from neighboring pixels should also increase 
classification accuracy. Several such ideas for mak- 
ing use of context have been tested with conventional 
classifiers (Mohn et aL 1987 [tests several prior ap- 
proaches], Lee Sc Philpol 1991, Jeon Sc Landgrebe 
1992) and a neural network approach (Bischof et al. 
1992). The neural network approach combines spec- 
tra from the pixel to be classified and from neigh- 


boring pixels into a single feature vector. The neural 
network then learns from the training set how much 
weight should be placed on information from neigh- 
boring pixels in classifying the central pixel. Bischof 
et. al. demonstrated a 5% improvement with this 
method vs. single pixel classification. We are cur- 
rently testing this contextual technique with our new 
MME energy functions. A two- pass hybrid spec- 
tral/spatial approach is also planned to overcome pro- 
jection registration and distortion problems. 

Lastly, editing the training set should also help 
improve results. As noted elsewhere (Williams et 
al. 1984), any minor errors registering groundtruth 
with the Themat ic Mapper data could result in misla- 
beled samples. Therefore, training samples near class 
boundaries in the image should be deleted. 
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